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Abstract 


Semantic segmentation is the task of assigning a class-label to each 
pixel in an image. We propose a region-based semantic segmentation frame¬ 
work which handles both full and weak supervision, and addresses three 
common problems: (1) Objects occur at multiple scales and therefore we 
should use regions at multiple scales. However, these regions are over¬ 
lapping which creates conflicting class predictions at the pixel-level. (2) 

Class frequencies are highly imbalanced in realistic datasets. (3) Each pixel 
can only be assigned to a single class, which creates competition between 
classes. We address all three problems with a joint calibration method 
which optimizes a multi-class loss deflned over the flnal pixel-level output 
labeling, as opposed to simply region classiflcation. Our method outper¬ 
forms the state-of-the-art on the popular SIFT Flow [lEE] dataset in both 
the fully and weakly supervised setting by a considerably margin (-1-6% and 
-1-10%, respectively). 

1 Introduction 

Semantic segmentation is the task of assigning a class label to each pixel in an 
image (Fig. 1). In the fully supervised setting, we have ground-truth labels for 
all pixels in the training images. In the weakly supervised setting, class-labels 
are only given at the image-level. We tackle both settings in a single framework 
which builds on region-based classification. 

Our framework addresses three important problems common to region-based 
semantic segmentation. First of all, objects naturally occur at different scales 
within an image [□, E3]. Performing recognition at a single scale inevitably 
leads to regions covering only parts of an object which may have ambiguous ap¬ 
pearance, such as wheels ov fur, and to regions straddling over multiple objects, 
whose classification is harder due to their mixed appearance. Therefore many 
recent methods operate on pools of regions computed at multiple scales, which 
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Figure 1: Semantic segmentation is the task of assigning class labels to all pixels 
in the image. During training, with full supervision we have ground-truth labels 
of all pixels. With weak supervision we only have labels at the image-level. 


have a much better chance of containing some regions covering complete ob¬ 
jects [□, i, O, O, [IB, m, E3]. However, this leads to overlapping regions which 
may lead to conflicting class predictions at the pixel-level. These conflicts need 
to be properly resolved. 

Secondly, classes are often unbalanced [□, □, El, O, EB, EE, E3, HD, E3, E3, 
O, S3, S3]: “cars” and “grass” are frequently found in images while “tricycles” 
and “gravel” are much rarer. Due to the nature of most classiflers, without careful 
consideration these rare classes are largely ignored: even if the class occurs in an 
image the system will rarely predict it. Since class-frequencies typically follow 
a power-law distribution, this problem becomes increasingly important with the 
modern trend towards larger datasets with more and more classes. 

Finally, classes compete: a pixel can only be assigned to a single class (e.g. 
it can not belong to both “sky” and “airplane”). To properly resolve such compe¬ 
tition, a semantic segmentation framework should take into account predictions 
for multiple classes jointly. 

In this paper we address these three problems with a joint calibration method 
over an ensemble of SVMs, where the calibration parameters are optimized over 
all classes, and for the flnal evaluation criterion, i.e. the accuracy of pixel- 
level labeling, as opposed to simply region classiflcation. While each SVM is 
trained for a single class, their joint calibration deals with the competition be¬ 
tween classes. Furthermore, the criterion we optimize for explicitly accounts for 
class imbalance. Finally, competition between overlapping regions is resolved 
through maximization: each pixel is assigned the highest scoring class over all 
regions covering it. We jointly calibrate the SVMs for optimal pixel labeling af¬ 
ter this maximization, which effectively takes into account conflict resolution be¬ 
tween overlapping regions. Experiments on the popular SIFT Flow [EE] dataset 
show a considerable improvement over the state-of-the-art in both the fully and 
weakly supervised setting (- 1 - 6 % and -i- 10 %, respectively). 

2 Related work 

Early works on semantic segmentation used pixel- or patch-based features over 
which they deflne a Condition Random Field (CRF) [ED, EE]. Many mod¬ 
ern successful works use region-level representations, both in the fully super¬ 
vised [El, i, 0, O, O, EB, ED, El, EE, E9, E3, El, EB, S3] and weakly super¬ 
vised [E3, SD, O, S3, SI, S3] settings. A few recent works use CNNs to learn a 
direct mapping from image to pixel labels [□, O, El, E3, E3, E3, EE, E9, EID, SB], 
although some of them [□, EE, E9] use region-based post-processing to impose 
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label smoothing and to better respect object boundaries. Other recent works use 
CRFs to refine the CNN pixel-level predictions [□, O, O, IZ3, S3]. In this work 
we focus on region-based semantic segmentation, which we discuss in light of 
the three problems raised in the introduction. 

Overlapping regions. Traditionally, semantic segmentation systems use su¬ 
perpixels [ffl, 0, ED, EE, E3, E3, El, EE, S3], which are non-overlapping regions 
resulting from a single-scale oversegmentation. However, appearance-based 
recognition of superpixels is difficult as they typically capture only parts of ob¬ 
jects, rather than complete objects. Therefore, many recent methods use over¬ 
lapping multi-scale regions [□, i, O, O, DIE, El, S3]. However, these may lead 
to conflicting class predictions at the pixel-level. Carreira et al. [i] address this 
simply by taking the maximum score over all regions containing a pixel. Both 
Hariharan et al. [O] and Girshick et al. [O] use non-maximum suppression, 
which may give problems for nearby or interacting objects [DIE]. Li et al. im 
predict class overlap scores for each region at each scale. Then they create su¬ 
perpixels by intersecting all regions. Finally, they assign overlap scores to these 
superpixels using maximum composite likelihood (i.e. taking all multi-scale 
predictions into account). Plath et al. [E3] use classification predictions over a 
segmentation hierarchy to induce label consistency between parent and child re¬ 
gions within a tree-based CRF framework. After solving their CRF formulation, 
only the smallest regions (i.e. leaf-nodes) are used for class prediction. In the 
weakly supervised setting, most works use superpixels [E3, SD, O, O] and so 
do not encounter problems of conflicting predictions. Zhang et al. [S3] use over¬ 
lapping regions to enforce a form of class-label smoothing, but they all have the 
same scale. A different Zhang et al. [S3] use overlapping region proposals at 
multiple scales in a CRF. 

Class imbalance. As the PASCAL VOC dataset [B] is relatively balanced, 
most works that experiment on it did not explicitly address this issue [ffl, i, Q, 
O, O, HE, O, [E3, E3, E3, E3, SE]. On highly imbalanced datasets such as SIFT 
Flow [DIE], Barcelona [E3] and LM-fSUN [E3], rare classes pose a challenge. 
This is observed and addressed by Tighe et al. [E3] and Yang et al. [S3]: for a 
test image, only a few training images with similar context are used to provide 
class predictions, but for rare classes this constraint is relaxed and more train¬ 
ing images are used. Vezhnevets et al. [E3] balance rare classes by normalizing 
scores for each class to range [0,1]. A few works [EE, O, S3] balance classes by 
using an inverse class frequency weighted loss function. 

Competing classes. Several works train one-vs-all classifiers separately and 
resolve labeling through maximization [3, O, O, HE, EE, E3, E3, E3]. This 
is suboptimal since the scores of different classes may not be properly cali¬ 
brated. Instead, Tighe et al. [E3, E3] and Yang et al. [i3] use Nearest Neigh¬ 
bor classification which is inherently multi-class. In the weakly supervised 
setting appearance models are typically trained in isolation and remain uncal¬ 
ibrated [EE, EE, O, S3, S3]. To the best of our knowledge, Boix et al. [ffl] is the 
only work in semantic segmentation to perform joint calibration of SVMs. While 
this enables to handle competing classes, in their work they use non-overlapping 
regions. In contrast, in our work we use overlapping regions where conflicting 
predictions are resolved through maximization. In this setting, joint calibration 
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is particularly important, as we will show in Sec. 4. As another difference, Boix 
et al. [ffl] address only full supervision whereas we address both full and weak 
supervision in a unified framework. 

3 Method 

3.1 Model 

We represent an image by a set of overlapping regions [E3] described by CNN 
features [O] (Sec. 3.4). Our semantic segmentation model infers the label Op of 
each pixel p in an image: 

Op = argmax a{wc-Xr, a^bc) (1) 

c, r3p 

As appearance models, we have a separate linear SVM per class c. These 
SVMs score the features Xr of each region r. The scores are calibrated by a sig¬ 
moid function a, with different parameters Oc^bc for each class c. The argmax 
returns the class c with the highest score over all regions that contain pixel p. 
This involves maximizing over classes for a region, and over the regions that 
contain p. 

During training we find the SVM parameters Wc (Sec. 3.2) and calibration 
parameters Oc and be (Sec. 3.3). The training of the calibration parameters takes 
into account the effects of the two maximization operations, as they are op¬ 
timized for the output pixel-level labeling performance (as opposed to simply 
accuracy in terms of region classification). 

3.2 SVM training 

Fully supervised. In this setting we are given ground-truth pixel-level labels 
for all images in the training set (Fig. 1). This leads to a natural subdivision 
into ground-truth regions, i.e. non-overlapping regions perfectly covering a sin¬ 
gle class. We use these as positive training samples. However, such idealized 
samples are rarely encountered at test time since there we have only imperfect re¬ 
gion proposals [E3] . Therefore we use as additional positive samples for a class 
all region proposals which overlap heavily with a ground-truth region of that 
class (i.e. Intersection-over-Union greater than 50% [0]). As negative samples, 
we use all regions from all images that do not contain that class. In the SVM 
loss function we apply inverse frequency weighting in terms of the number of 
positive and negative samples. 

Weakly supervised. In this setting we are only given image-level labels on the 
training images (Fig. 1). Hence, we treat region-level labels as latent variables 
which are updated using an alternated optimization process (as in [E3, BD, O, 
O, B3]). To initialize the process, we use as positive samples for a class all re¬ 
gions in all images containing it. At each iteration we alternate between training 
SVMs based on the current region labeling and updating the labeling based on 
the current SVMs (by assigning to each region the label of the highest scoring 
class). In this process we keep our negative samples constant, i.e. all regions 
from all images that do not contain the target class. In the SVM loss function 
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Figure 2: The first row shows multiple region proposals (left) extracted from 
an image (right). The following rows show the per-class SVM scores of each 
region (left) and the pixel-level labeling (right). Row 2 shows the results before 
and row 3 after joint calibration. 

we apply inverse frequency weighting in terms of the number of positive and 
negative samples. 

3.3 Joint Calibration 

We now introduce our joint calibration procedure, which addresses three com¬ 
mon problems in semantic segmentation: (1) conflicting predictions of overlap¬ 
ping regions, (2) class imbalance, and (3) competition between classes. 

To better understand the problem caused by overlapping regions, consider 
the example of Fig. 2. It shows three overlapping regions, each with differ¬ 
ent class predictions. The final goal of semantic segmentation is to output a 
pixel-level labeling, which is evaluated in terms of pixel-level accuracy. In our 
framework we employ a winner-takes all principle: each pixel takes the class of 
the highest scored region which contains it. Now, using uncalibrated SVMs is 
problematic (second row in Fig. 2). SVMs are trained to predict class labels at 
the region-level, not the pixel-level. However, different regions have different 
area, and, most importantly, not all regions contribute all of their area to the final 
pixel-level labeling: Predictions of small regions may be completely suppressed 
by bigger regions (e.g. in Fig. 2, row 3, the inner-boat region is suppressed by the 
prediction of the complete boat). In other cases, bigger regions may be partially 
overwritten by smaller regions (e.g. in Fig. 2 the boat region partially overwrites 
the prediction of the larger boat-Hsky region). Furthermore, the SVMs are trained 
in a one-vs-all manner and are unaware of other classes. Hence they are unlikely 
to properly resolve competition between classes even within a single region. The 
problems above show that without calibration, the SVMs are optimized for the 
wrong criterion. We propose to jointly calibrate SVMs for the correct criterion, 
which corresponds better to the evaluation measure typically used for semantic 
segmentation (i.e. pixel labeling accuracy averaged over classes). We do this by 
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applying sigmoid functions a to all SVM outputs: 

G(Wc'Xr, ac,bc) = (l+exp((3c-Wc-x^ + Z?c))“^ 


( 2 ) 


where ac^bc are the calibration parameters for class c. We calibrate the param¬ 
eters of all classes jointly by minimizing a loss function C{oJ), where o is the 
pixel labeling output of our method on the full training set (o = {op; p = l...P}) 
and I the ground-truth labeling. 

We emphasize that the pixel labeling output o is the result after the maxi¬ 
mization over classes and regions in Eq. (1). Since we optimize for the accuracy 
of this final output labeling, and we do so jointly over classes, our calibration 
procedure takes into account both problems of conflicting class predictions be¬ 
tween overlapping regions and competition between classes. Moreover, we also 
address the problem of class imbalance, as we compensate for it in our loss 
functions below. 

Fully supervised loss. In this setting our loss directly evaluates the desired 
performance measure, which is typically pixel labeling accuracy averaged over 
classes [□, [E3, ES, E3, i3] 



( 3 ) 


where Ip is the ground-truth label of pixel p, Op is the output pixel label. Pc is 
the number of pixels with ground-truth label c, and C is the number of classes. 
[•] is 1 if the condition is true and 0 otherwise. The inverse frequency weighting 
factor \/Pc deals with class imbalance. 

Weakly supervised loss. Also in this setting the performance measure is typi¬ 
cally class-average pixel accuracy [E3, SQ, S3, S3]. Since we do not have ground- 
truth pixel labels, we cannot directly evaluate it. We do however have a set 
of ground-truth image labels U which we can compare against. We first ag¬ 
gregate the output pixel labels Op over each image nii into output image labels 
Oi = Cp^rrn Op. Then we define as loss the difference between the ground-truth la¬ 
bel set li and the output label set Oi, measured by the Hamming distance between 
their binary vector representations 




(4) 


where k^c = 1 if label c is in //, and 0 otherwise (analog for Oi^c)- I is the total 
number of training images, f is the number of images having ground-truth label 
c, so the loss is weighted by the inverse frequency of class labels, measured at 
the image-level. Note how also in this setting the loss looks at performance after 
the maximization over classes and regions (Eq. (1)). 

Optimization. We want to minimize our loss functions over the calibration 
parameters Oc^bc of all classes. This is hard, because the output pixel labels Op 
depend on these parameters in a complex manner due to the max over classes 
and regions in Eq. (1), and because of the set-union aggregation in the case of 
the weakly supervised loss. Therefore, we apply an approximate minimization 
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Figure 3: Our efficient evaluation algorithm uses the bottom-up structure of 
Selective Search region proposals to simplify the spatial maximization. We start 
from the root node and propagate the maximum score with its corresponding la¬ 
bel down the tree. We label the image based on the labels of its superpixels (leaf 
nodes). 

algorithm based on coordinate descent. Coordinate descent is different from 
gradient descent in that it can be used on arbitrary loss functions that are not dif¬ 
ferentiable, as it only requires their evaluation for a given setting of parameters. 

Coordinate descent iteratively applies line search to optimize the loss over a 
single parameter at a time, keeping all others fixed. This process cycles through 
all parameters until convergence. As initialization we use constant values (ac = 
—l,bc = 0). During line search we consider 10 equally spaced values (ar in 
[-12,-2], be in [-10,10]). 

This procedure is guaranteed to converge to a local minimum on the search 
grid. While this might not be the global optimum, in repeated trials we found the 
results to be rather insensitive to initialization. Furthermore, in our experiments 
the number of iterations was roughly proportional to the number of parameters. 

Efficient evaluation. On a typical training set with C = 30 classes, our joint 
calibration procedure evaluates the loss thousands of times. Hence, it is impor¬ 
tant to evaluate pixel-level accuracy quickly. As the model involves a maximum 
over classes and a maximum over regions at every pixel, a naive per-pixel im¬ 
plementation would be prohibitively expensive. Instead, we propose an efficient 
technique that exploits the nature of the Selective Search region proposals [E3], 
which form a bottom-up hierarchy starting from superpixels. As shown in Fig. 3, 
we start from the region proposal that contains the entire image (root node). 
Then we propagate the maximum score over all classes down the region hier¬ 
archy. Eventually we assign to each superpixel (leaf nodes) the label with the 
highest score over all regions that contain it. This label is assigned to all pixels 
in the superpixel. To compute class-average pixel accuracy, we normally need 
to compare each pixel label to the ground-truth label. However since we assign 
the same label to all pixels in a superpixel, we can precompute the ground-truth 
label distribution for each superpixel and use it as a lookup table. This reduces 
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the runtime complexity for an image from 0{Pi Ri C) to 0{Ri 'C), where Pi and 
Ri are the number of pixels and regions in an image respectively, and C is the 
number of classes. 

Why no Platt scaling. At this point the reader may wonder why we do not 
simply use Platt scaling [IZ3] as is commonly done in many applications. Platt 
scaling is used to convert SVM scores to range [0,1] using sigmoid functions, 
as in Eq. (2). However, in Platt scaling the parameters ac^bc are optimized for 
each class in isolation, ignoring class competition. The loss function Cc in Platt 
scaling is the cross-entropy function 

(^C5 0 ~ l0S(^c(-^r)) (1 ^r,c) log(l O'c(v^)) (5) 

r 

where A+ is the number of positive samples, N- the number of negative samples, 
and tr^c = if = <^ or ty^c = otherwise; ly is the region-level label. This 
loss function is inappropriate for semantic segmentation because it is defined in 
terms of accuracy of training samples, which are regions, rather than in terms 
of the final pixel-level accuracy. Hence it ignores the problem of overlapping 
regions. There is also no inverse frequency term to deal with class imbalance. 
We experimentally compare our method with Platt scaling in Sec. 4. 


3.4 Implementation Details 

Region proposals. We use Selective Search [E3] region proposals using a sub¬ 
set of the “Fast” mode: we keep the similarity measures, but we restrict the scale 
parameter k to 100 and the color-space to RGB. This leads to two bottom-up 
hierarchies of one initial oversegmentation [S]. 

Features. We show experiments with features generated by two CNNs (AlexNet [O], 
VGG16 [E3]) using the Caffe implementations [O]. We use the R-CNN [O] 
framework for AlexNet, and Fast R-CNN (FRCN) [DU] for VGG16, in order to 
maintain high computational efficiency. Regions are described using all pixels in 
a tight bounding box. Since regions are free-form, Girshick et al. [O] addition¬ 
ally propose to set pixels not belonging to the region to zero (i.e. not affecting 
the convolution). However, in our experiments this did not improve results so 
we do not use it. For the weakly supervised setting we use the CNNs pre-trained 
for image classification on IFSVRC 2012 [123]. For the fully supervised setting 
we finetune them on the training set of SIFT Flow [03] (i.e. the semantic seg¬ 
mentation dataset we experiment on). For both settings, following [O] we use 
the output of the fc6 layer of the CNN as features. 

SVM training. Fike [O] we set the regularization parameter C to a fixed value 
in all our experiments. The SVMs minimize the F2-loss for region classification. 

We use hard-negative mining to reduce memory consumption. 


4 Experiments 

Datasets. We evaluate our method on the challenging SIFT Flow dataset [HE] . 
It consists of 2488 training and 200 test images, pixel-wise annotated with 33 
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Method 

Class Ace. 

Vezhnevets et al. [E3] 

14.0% 

Vezhnevets et al. [ED] 

21.0% 

Zhang et al. [E3] 

27.7% 

Xu et al. [O] 

27.9% 

Zhang et al. [E3] 

32.3% 

Xu et al. [E3] 

35.0% 

Xu et al. [E3] 

(transductive) 

41.4% 

Ours SVM (AlexNet) 

21.2% 

Ours SVM+PS (AlexNet) 

16.8% 

Ours SVM+JC (AlexNet) 

37.4% 

Ours SVM+JC (VGG16) 

44.8% 


Method 

Class Ace. 

Byeon et al. [□] 

22.6% 

Tighe et al. [E3] 

29.1% 

Pinheiro et al. [123] 

30.0% 

Shuai et al. [ED] 

39.7% 

Tighe et al. [E3] 

41.1% 

Keke? et al. [O] 

45.8% 

Sharma et al. [EE] 

48.0% 

Yang et al. [E3] 

48.7% 

George et al. [□] 

50.1% 

Farabet et al. [□] 

50.8% 

Long et al. [O] 

51.7% 

Sharma et al. [E3] 

52.8% 

Ours SVM (AlexNet) 

28.7% 

Ours SVM+PS (AlexNet) 

27.7% 

Ours SVM+JC (AlexNet) 

55.6% 

Ours SVM+JC (VGG16) 

59.2% 


Table 1: Class-average pixel accuracy in the fully supervised (left) and the 
weakly supervised setting (right) setting. We show results for our model on the 
test set of SIFT Flow using uncalibrated SVM scores (SVM), traditional Platt 
scaling (PS) and joint calibration (JC). 


class labels. The class distribution is highly imbalanced in terms of overall re¬ 
gion count as well as pixel count. As evaluation measure we use the popular 
class-average pixel accuracy [□, O, IZ3, 123, E]l, EB, SD, S3, S3, S3]. For both 
supervision settings we report results on the test set. 

Fully supervised setting. Table 1 evaluates various versions of our model in 
the fully supervised setting, and compares to other works on SIFT Flow. Using 
AlexNet features and uncalibrated SVMs, our model achieves a class-average 
pixel accuracy of 28.7%. If we calibrate the SVM scores with traditional Platt 
scaling results do not improve (27.7%). Using our proposed joint calibration to 
maximize class-average pixel accuracy improves results substantially to 55.6%. 
This shows the importance of joint calibration to resolve conflicts between over¬ 
lapping regions at multiple scales, to take into account competition between 
classes, and generally to optimize a loss mirroring the evaluation measure. 

Fig. 4 (column “SVM”) shows that larger background regions (i.e. sky, 
building) swallow smaller foreground regions (i.e. boat, awning). Many of these 
small objects become visible after calibration (column “SVM-fJC”). This issue 
is particularly evident when working with overlapping regions. Consider a large 
region on a building which contains an awning. As the surface of the awning is 
small, the features of the large region will be dominated by the building, leading 
to strong classification score for the ‘building’ class. When these are higher than 
the classification score for ‘awning’ on the small awning region, the latter gets 
overwritten. Instead, this problem does not appear when working with super¬ 
pixels [ID]. A superpixel is either part of the building or part of the awning, so 
a high scoring awning superpixel cannot be overwritten by neighboring build¬ 
ing superpixels. Hence, joint calibration is particularly important when working 
with overlapping regions. 
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Regions 

Class Acc. 

FH [S] 

SS [El] 

43.4% 

55.6% 


Finetuned 

Class Acc. 

no 

yes 

49.4% 

55.6% 


Table 2: Comparison of single- Table 3: Effect of CNN finetuning 

scale (FH) and multi-scale (SS) re- in the fully supervised setting using 

gions using SVM-fJC (AlexNet). SVM-fJC (AlexNet). 

Using the deeper VGG16 CNN the results improve further, leading to our 
final performance 59.2%. This outperforms the state-of-the-art [123] by 6.4%. 

Weakly supervised setting. Table 1 shows results in the weakly supervised 
setting. The model with AlexNet and uncalibrated SVMs achieves an accuracy 
of 21.2%. Using traditional Platt scaling the result is 16.8%, again showing it 
is not appropriate for semantic segmentation. Instead, our joint calibration al¬ 
most doubles accuracy (37.4%). Using the deeper VGG16 CNN results improve 
further to 44.8%. 

Fig. 5 illustrates the power of our weakly supervised method. Again rare 
classes appear only after joint calibration. Our complete model outperforms the 
state-of-the-art [O] (35.0%) in this setting by 9.8%. Xu et al. [O] addition¬ 
ally report results on the transductive setting (41.4%), where all (unlabeled) test 
images are given to the algorithm during training. 

Region proposals. To demonstrate the importance of multi-scale regions, we 
also analyze oversegmentations that do not cover multiple scales. To this end, we 
keep our framework the same, but instead of Selective Search (SS) [E2I] region 
proposals we used a single oversegmentation using the method of Felzenszwalb 
and Huttenlocher (FH) [H] (for which we optimized the scale parameter). As 
Table 2 shows, SS regions outperform FH regions by a good margin of 12.2% in 
the fully supervised setting. This confirms that overlapping multi-scale regions 
are superior to non-overlapping oversegmentations. 

CNN finetuning. As described in 3.4 we finetune our network for detection in 
the fully supervised case. Table 3 shows that this improves results by 6.2% com¬ 
pared to using a CNN trained only for image classification on IFSVRC 2012. 

5 Conclusion 

We addressed three common problems in semantic segmentation based on re¬ 
gion proposals: (1) overlapping regions yield conflicting class predictions at the 
pixel-level; (2) class-imbalance leads to classifiers unable to detect rare classes; 
(3) one-vs-all classifiers do not take into account competition between multiple 
classes. We proposed a joint calibration strategy which optimizes a loss defined 
over the final pixel-level output labeling of the model, after maximization over 
classes and regions. This tackles all three problems: joint calibration deals with 
multi-class predictions, while our loss explicitly deals with class imbalance and 
is defined in terms of pixel-wise labeling rather than region classification accu¬ 
racy. As a result we take into account conflict resolution between overlapping 
regions. Our method outperforms the state-of-the-art in both the fully and the 
weakly supervised setting on the popular SIFT Flow [HE] benchmark. 

Acknowledgements. Work supported by the ERG Starting Grant VisCul. 
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Figure 4: Fully supervised semantic segmentation on SIFT Flow. We present 
uncalibrated SVM results (SVM) and jointly calibrated results (SVM-fJC), both 
with VGG16. 
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Figure 5: Weakly supervised semantic segmentation on SIFT Flow. We 
present uncalibrated SVM results (SVM) with AlexNet, jointly calibrated re¬ 
sults (SVM-hJC) with AlexNet, and with VGG16. 
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